Estimating home values is an important aspect of the Windfall net worth model. Property often represents a significant portion of an individual’s net worth and, due to changing home values, can be a significant driver of wealth growth.
Your task is to create a model using the attached dataset to attempt to predict home values in Denver.
The dataset contains information on approximately 3,000 single-family homes in Denver, CO. Each row contains the following information about the property that can be used to estimate the home’s value:
In addition, we have included the Zillow “Zestimate” (http://www.zillow.com/zestimate) of each home’s current value in the estimated_value
column of the attached file. You should use these estimates to train your model.
We have reserved a set of single family home records from Denver that we will use to assess the quality of your model. The reserve dataset is in the same format as the training set.
Please implement a model using either python or R. When you are finished, send us back:
Here are some points you may want to cover when writing your explanation:
In [1]:
import pandas as pd
import numpy as np
import scipy.stats
%pylab inline
In [2]:
csv = pd.read_csv("single_family_home_values.csv", parse_dates=["last_sale_date"])
In [3]:
print csv.shape
csv.head()
Out[3]:
In [4]:
#scale the data
from sklearn import preprocessing
from scipy import stats
In [5]:
# fill missing values (0's) w/ the median for the column
cols = ["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built",
"last_sale_amount", "estimated_value"]
for col in cols:
csv.loc[csv[col]==0, col] = csv[col].median()
In [6]:
csv.head(4)
Out[6]:
Although location is obviously a VERY important characteristic of home value, we CANNOT use it in its current form in this model.
If we use zip code, an ordinal value, we would have to one-hot encode it, which would add about 30 dimensions, which due to the relatively small size of the training set would add a large factor of bias to the model and would other words drastically overfit.
A more optimal solution, outside of the scope if this exercise would be to reverse geocode the address to lat/lng paris. If I can convert address to lat/lng it could be considered a continuous feature and only add 2 dimensions and get a lot of added value.
For now we have to ignore location information.
P.S.
One of my favorite uses of zip code data is to look up demographic variables based on zipcode that may not be available at the individual level otherwise...
This feature is useless, removing it...
In [154]:
model_cols = ["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built",
"last_sale_amount", "last_sale_date", "estimated_value"]
housing=csv[model_cols]
housing.head()
Out[154]:
In [155]:
housing["day"] = (housing.last_sale_date - pd.datetime(1900,1,1))/np.timedelta64(1,'D')
housing.drop("last_sale_date", 1, inplace=True)
In [156]:
housing.head(5)
Out[156]:
The column num_rooms is the total number of rooms in the property including bedrooms and baths, so there is some overlap between this and the num_bedrooms
and num_baths
columns. We can subtract the sum of num_bedrooms and num_baths from num_rooms to increase the variance in the column. SO this column will essentially be transformed to other_room_count
Note
I think I don't understand this column, b/c example id 3393 (see below) has less total rooms than bedrooms+bathrooms. I'm just going to skip this optimization for now.
In [163]:
housing.iloc[3393]
Out[163]:
In [10]:
# outlier_check_cols = ["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built",
# "last_sale_amount", "estimated_value", "day"]
# for col in outlier_check_cols:
# housing[col] = np.abs(stats.zscore(housing[col])) < 3
housing_sans_outliers = housing[(np.abs(stats.zscore(housing)) < 3).all(axis=1)]
print housing.shape
print housing[(np.abs(stats.zscore(housing)) < 3).all(axis=1)].shape
In [11]:
housing_sans_outliers.head(5)
Out[11]:
In [12]:
#check correlations
from pandas.tools.plotting import scatter_matrix
_=scatter_matrix(housing_sans_outliers, alpha=0.2, figsize=(17, 17), diagonal='kde')
loss minimizers/optimizers usually perform better when the variables have means and std deviations that are not too far apart. Doing log() or sqrt() usually helps reduce the scale of the variables so that they are closer to each other.
I log transformed certain features for which the skew was > 0.75. This will make the feature more normally distributed and this makes linear regression perform better - since linear regression is sensitive to outliers. Note that if I used a tree-based model I wouldn't need to transform the variables.
In [13]:
#log transform skewed numeric features:
numeric_feats = housing_sans_outliers.dtypes[housing_sans_outliers.dtypes != "object"].index
skewed_feats = housing_sans_outliers[numeric_feats].apply(lambda x: stats.skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
print skewed_feats
housing_sans_outliers[skewed_feats] = np.log1p(housing_sans_outliers[skewed_feats])
In [14]:
housing_sans_outliers.head()
Out[14]:
In [15]:
# scaled_sq_footage = preprocessing.scale(csv[["square_footage"]])
_ = housing_sans_outliers[["square_footage"]].boxplot(return_type='dict')
_ = housing_sans_outliers[["square_footage"]].hist(bins=50)
For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
We know that scaling is beneficial for linear models. We already sclaed in essence in the previous skew correction step, however, it didn't run for every feature, so we'll just do it here for all features for good measure.
In [105]:
final = housing_sans_outliers[["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built",
"last_sale_amount", "day", "estimated_value"]]
# through experimentation scaling doesn't help much
# scaled = preprocessing.scale(final.values)
scaled = final.values
features = scaled[:,0:-1]
labels = scaled[:,-1]
Let's create a training and test set
In [106]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import ensemble
from sklearn.metrics import r2_score
In [107]:
#test train split
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.25, random_state=4)
In [111]:
lassocv = linear_model.LassoCV(alphas=[0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75])
lassocv.fit(features, labels)
lassocv.alpha_
Out[111]:
In [190]:
lasso = linear_model.Lasso(alpha=lassocv.alpha_)
lasso.fit(X_train, Y_train)
lasso.score(X_test, Y_test)
Out[190]:
As you can see from the above output, Lasso scores an R^2 of .576. Let's check out what features it found important and then compare w/ Ridge:
In [113]:
training_cols = ["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built",
"last_sale_amount", "day"]
coef = pd.Series(lasso.coef_, index = training_cols)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
imp_coef = coef.sort_values()
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
Out[113]:
Some of these make sense: Last Sale, Num Baths, Num Rooms are intuitively related to home price prediction. However, it is a bit incomprehensible that it decided to disgard lot size as having no importance. This can be seen as a limitation of the Lasso Regularization, let's see if Ridge can overcome this:
In [108]:
regcv = linear_model.RidgeCV(alphas=[0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75])
regcv.fit(features, labels)
regcv.alpha_
Out[108]:
In [191]:
reg = linear_model.Ridge(alpha=regcv.alpha_)
reg.fit(X_train, Y_train)
ridge_preds = reg.predict(X_test)
reg.score(X_test, Y_test)
Out[191]:
Nice! We have a marked improvement over Lasso!
In [250]:
parameters = {'n_estimators':[100,300,500],'max_depth':[2, 5, 10], 'min_samples_split': [2,4,8],
'learning_rate': [0.1, .5]}
gbr = ensemble.GradientBoostingRegressor()
gscv = GridSearchCV(gbr, parameters)
gscv.fit(features, labels)
Out[250]:
In [251]:
gscv.best_estimator_
Out[251]:
In [252]:
params = {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2,
'learning_rate': 0.1, 'loss': 'ls'}
gbr = gscv.best_estimator_
gbr.fit(X_train, Y_train)
gbr.score(X_test, Y_test)
Out[252]:
In [242]:
0.76074209487440014
Out[242]:
In [115]:
gb_preds = gbr.predict(X_test)
And there you have it, our blended model performs better (76.63%) than its parts (76.22% & 57.60%)
Now let's formalize the blended model w/ mlxtend and run Evaluation tests:
In [200]:
from mlxtend.regressor import StackingRegressor
In [253]:
meta = linear_model.LinearRegression()
blender = StackingRegressor(regressors=[reg, gbr], meta_regressor=meta)
_=blender.fit(X_train, Y_train)
y_pred = blender.predict(X_test)
blender.score(X_test, Y_test)
Out[253]:
In [228]:
from sklearn.model_selection import cross_val_score
In [254]:
scores = cross_val_score(blender, features, labels, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
As you can see the R^2 is .78 which is very respectful.
Lets take the average diff between actual and predicted: On this training set it is about 18% error which is very respectful.
In [255]:
mean_diff = np.mean(np.abs(np.exp(Y_test)-np.exp(y_pred)))
p_mean_diff = np.mean(mean_diff/np.exp(Y_test))
print "Mean Error:\t %.0f/%0.3f%%" % (mean_diff, p_mean_diff*100)